Explore and Summarize Data

1. Introduction

The datasets were created, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The

## [1] 1599   13

We have 13 variables and 1599 entries.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

We are unsure of the variable “X”. Let’s see what kind of values it contains.

##  [1]  1  2  3  4  5  6  7  8  9 10

This column seems more like an index than anything else.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We see except quality all other variables are numerical (continuous), whereas quality is integer (discrete).

Here’s a description of the data.

  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • chlorides: the amount of salt in the wine
  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • density: the density of water is close to that of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  • sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
  • alcohol: the percent alcohol content of the wine
  • quality: (score between 0 and 10)

2. Univariate Plots Section

2.1 Fixed Acidity

Acids are major wine constituents and contribute greatly to its taste. In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” Traditionally total acidity is divided into two groups, namely the volatile acids (see separate description) and the nonvolatile or fixed acids. Wines that are high in acidity tastes sour.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The graph is right skewed. The median is 7.9 and values larger than 12.35 are outliers. We can also say that the wines mostly tend to have a medium mix of non-volatile acids.

2.2 Volatile Acidity

Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid. While acetic acid is generally considered a spoilage product (vinegar), some winemakers seek a low or barely detectable level of acetic acid to add to the perceived complexity of a wine. In addition, the production of acetic acid will result in the concomitant formation of other, sometimes unpleasant, aroma compounds.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The graph seems right skewed, making it clear that the most of the wines tested seems to contain low quantity of volatile acids, while some tend to have high quantities. There are a few outliers undoubtedly.

2.3 Citric Acid

These inexpensive supplements can be used by winemakers in solidification to boost the wine’s total acidity. It is used less frequently than tartaric and malic due to the aggressive citric flavors it can add to the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The graph seems to follow no particular pattern. But we can say that it follows a uniform distribution till 0.5 after which it’s presence falls.

2.4 Residual Sugar

Residual Sugar, or RS for short, refers to any natural grape sugars that are leftover after fermentation ceases (whether on purpose or not). The juice of wine grapes starts out intensely sweet, and fermentation uses up that sugar as the yeasts feast upon it. So if the wine has sugar you will probably want strong acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution is right skewed with a large number of outliers. It seems that the residual sugar is mainly concentrated around a low value.

2.5 Chlorides

The amount of salt in wine is increased in wines coming from vineyards which are near the sea coast, which have brackish sub—soil or which have arid ground irrigated with salt water and the molar ratio cf Cl/Na+ therefore varies significantly and can even have a value close to one which could imply the addition of salt (NaCl) to the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The graph seems right skewed suffering from heavy outliers. Also the content of salt seems pretty low.

2.6 Free Sulphur Di-Oxide

In the wine industry, sulfur dioxide (SO2) is frequently added to must and juice as a preservative to prevent bacterial growth and slow down the process of oxidation by inhibiting oxidation enzymes. SO2 also improves the taste and retains the wine’s fruity flavors and freshness of aroma

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The graph seems right skewed.

2.7 Total Sulphur Di-Oxide

amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The graph seems right skewed with median as 38 (less than 50) and the mode is 20.

2.8 Density

The density based on the percent alcohol and sugar content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The graph seems normally distributed.

2.9 pH

describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The graph seems normally distributed.

2.10 Sulphates

A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The graph seems right skewed.

2.11 Alcohol

The percent alcohol content of the wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The graph does not follow a pattern.

2.12 Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The graph seems normally distributed with mode as 5.

2.13 Univariate Analysis

What is the structure of your dataset?

There are 1599 entries with 13 features (12 + 1 added as rating).

What is/are the main feature(s) of interest in your dataset?

There are quite a few like the balance of acidity, residual sugar and chlorides that engineers the taste of wine. Again how other factors like density and pH varies.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There can be the various forms of acids like volatile, non-volatile and citric that contribute to the features of interest. Again alcohols can also be responsible for pH.

Did you create any new variables from existing variables in the dataset?

Yes. I created a rating variable based on the existing feature called quality. 3-4: Bad, 5-6: OK, 7-8: Good.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There may not be any unusual distributions, however some graphs do not follow any regular pattern. This may be due to some larking variables. Since the data provided is promised to be clean, I did not perform any wrangling.

3. Bivariate Plots Section

Before we dig in bivariate plot analysis, it is a good thing if we look for at the co-relation matrix, which will help us to identify the potential related variables. Analyzing them may be interesting then.

Some potential positively related variable pairs are:

  • citric.acid and fixed.acidity
  • density and fixed.acidity
  • density and citric.acid
  • density and residual.sugar
  • sulphates and chlorides
  • quality and alcohol

Some potential negatively related variable pairs are:

  • pH and fixed.acidity
  • citric.acid and volatile.acidity
  • pH and citric.acid
  • alcohol and density
  • quality and volatile.acidity

Let us now visualize how the above pairs will look when plotted.

3.1 Citric Acid VS Fixed Acidity

## [1] "The co-relation coefficient is  0.671703434764106"

We can say that this is a must trend as citric acid is mixed to increase total acidity. So in cases where citric acid is high, the non volatile acid should also be high.

3.2 Density VS Fixed Acidity

## [1] "The co-relation coefficient is  0.668047292118974"

This is an interesting correlation. We know wines with high residual sugar also has high acid content making it more appealing to the taste buds. Also, density and residual sugar is positively correlated making the above plot possible.

3.3 Density VS Citric Acid

## [1] "The co-relation coefficient is  0.364947175211251"

This is a weak correlation, making it possible that high density wines may have actual acid content high (not artificially induced citric acid) based on the previous plot.

3.4 Density VS Residual Sugar

## [1] "The co-relation coefficient is  0.355283370983376"

This is what we have talked about before, one of the factors of high density is residual sugar. This graph is a good example to prove it, the correlation efficient though being slightly low.

3.5 Quality VS Alcohol

## [1] "The co-relation coefficient is  0.476166324001136"

This is invariably true that with high percentage of alcohol, the quality of wine will be better.

3.6 pH VS Fixed Acidity

## [1] "The co-relation coefficient is  -0.341699334785031"

This is again true. With highest fixed acidity, we get lowest pH value and vice-verse.

3.7 Citric Acid VS Volatile Acidity

## [1] "The co-relation coefficient is  -0.55249568455958"

To boost the wine’s total acidity, either citric acid or volatile acid is added (but not at the same time but as a trade-off). So, here’s a clear trend. Also, the basic use of the volatile acid remains within 0.6.

3.8 pH VS Citric Acid

## [1] "The co-relation coefficient is  -0.54190414473951"

This is again true. With highest citric acidity, we get lowest pH value and vice-verse.

3.9 Alcohol VS Density

## [1] "The co-relation coefficient is  -0.496179770241702"

Alcohol is lighter than water, that is density is less than 1. Thus the graph is true.

3.10 Quality VS Volatile Acidity

## [1] "The co-relation coefficient is  -0.390557780264007"

Volatile acidity is undesirable as it induces a bad taste. So, less volatile acid means good quality wine and vice-verse.

4. Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Density and acid/residual sugar shows a positive trend which is really as it should be as these chemicals adds in to the density. Also greater quantity of alcohol results in good quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Features such as inversion of pH with acidity, quality and volatile acids (high volatile acids imparts a bad taste).

What was the strongest relationship you found?

Fixed acidity vs citric acidity is the strongest relation that I found having a Pearson co-efficient of .67.

5. Multivariate Plots Section

5.1 Dispersion of Quality amongst Citric Acid and Fixed Acidity

Let us see how quality is dispersed among citric acid and fixed acidity.

We can see from this diagram that the darker shades are in the bottom section of the graph making it clear that bad quality of wine do have lower to medium contents of both the acids. That also suggests good winemakers do indulge in adding citric acids to right proportions to pull up the acidic content.

5.2 Dispersion of Quality amongst Density and Fixed Acidity

We can see that the good quality wines have low density but high acidic contents.

5.3 Dispersion of Quality amongst Density and Residual Sugar

Again we see that good wines or lighter blues have low density but high residual sugar content.

5.4 Dispersion of Quality amongst Alcohol and Density

Woow! Nicest trend. The bad quality wines have less alcohol content and are denser, while the opposite (high alcohol and less dense) is well reserved for the good wines and they tend to have a balanced pH (preferably low).

6. Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

What we were guessing from the very first about presence of acidity and residual sugar in good quality wines is proved. Also good quality wines have greater alcohol and less dense with comparatively low pH.

Were there any interesting or surprising interactions between features?

Winemakers tend to use citric acid to increase the overall acidity than relying on the inherent non-volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.

7. Final Plots and Summary

7.1 Plot One

7.2 Description One

The most promising plot supporting the argument that we have started from the very beginning - “Good wines have high alcohol content and lesser density with medium pH.”

7.3 Plot Two

7.4 Description Two

We can see that the clustering of good wines is near a place where the density is low but the acidity is high. This supports our previous argument that good wines have high acidic content.

7.5 Plot Three

7.6 Description Three

The argument that winemakers uses citric acid to pull up the acidic content of the wine in cases where volatile acidity is low. So, if the natural acidity is in good proportions, artificial citric ones are not included.

8. Reflection

This dataset has details of 1599 wines varied by twelve features from around 2009. I first did single variable analysis on the dataset, thus understanding the basic features of the wines and building the first step towards a skeptical data exploration journey. Next I visited the interesting variables in pairs and started to note down the possible trends trying to identify he reasons which determines the quality of wines.

Finally I ended up with the argument that is, Good wines have a high alcohol content, lesser density, medium pH and low density.

The hurdles I faced while developing the project was I was getting a single color whose tone varied for the different qualities of wines. Now, this was a bit challenging to interpret as the similar colors are hard to distinguish. So, I ended up factoring the discrete variable quality for the plotting purpose. The next problem was the six colors are not again totally visually appealing. So, I created a new variable - rating and based on the quality, rated the wines as Good, OK or Bad. This made things easy to interpret and look cool.

This exploration can even be boosted by using functions such as SelectKBest to know which features contribute most to predicting the quality of wine. Then we can build one or two classifiers and make a good wine predicting model for the future.